1 Design of the experiment
| 1 | 2 | 3 | 4 | 5 | |
|---|---|---|---|---|---|
| AUX | 25 | 8 | 9 | 11 | 5 |
| UNT | 20 | 6 | 8 | 7 | 18 |
2 Defining samples’ genotype and sex
Based on the plot above, I decide to proceed as follows:
- assigning genotype based on threshold on raw sum of reads mapping to AID2xMycFLAG exogenous sequence: sample is defined as wild-type if sum of AID reads <72
- assigning sex based on threshold on normalized counts of chrY-mapping gene Ddx3y (inspired by Groff et al., Genome Research 2019): sample is defined as Male if Ddx3y counts >30
- excluding sample A60 as sex-uncertain
3 Quality assessment and sample clustering
After applying the log2(n + 1) transformation, I plot heatmaps of normalized counts for top highly expressed genes to check for eventual big sample heterogeneity present in the dataset.
The heatmap shows that the samples cluster first of all by collection point - and that collection point 4 is particularly on its own. Since the blastocysts were also looking a bit less healthy in general in that batch, I first of all exclude collection 4 from the analysis. The number of genotypes I would end up with before and after removing collection 4 can be deduced from the following:
[[1]]
| AUX | UNT | |
|---|---|---|
| F_AID-Ogt | 8 | 8 |
| F_wt | 6 | 3 |
| M_AID-Ogt | 5 | 6 |
| M_wt | 6 | 3 |
[[2]]
| AUX | UNT | |
|---|---|---|
| F_AID-Ogt | 2 | 3 |
| F_wt | 2 | 2 |
| M_AID-Ogt | 3 | 1 |
| M_wt | 1 | 0 |
[[3]]
| AUX | UNT | |
|---|---|---|
| F_AID-Ogt | 3 | 3 |
| F_wt | 3 | 1 |
| M_AID-Ogt | 1 | 2 |
| M_wt | 2 | 2 |
[[4]]
| AUX | UNT | |
|---|---|---|
| F_AID-Ogt | 2 | 0 |
| F_wt | 7 | 2 |
| M_AID-Ogt | 0 | 1 |
| M_wt | 1 | 4 |
[[5]]
| AUX | UNT | |
|---|---|---|
| F_AID-Ogt | 0 | 2 |
| F_wt | 3 | 6 |
| M_AID-Ogt | 0 | 4 |
| M_wt | 2 | 6 |
3.1 PCA
- PCA that uses the ranks of the first 1000 ranks of raw counts, colored by collection batch, to see whether the batch effect is visible.
- The plot shows that there is indeed a batch effect, especially between the first and the fifth collection.
I will try to remove the unwanted variation linked with the batch of collection using package RUVSeq.
3.2 Removing unwanted variation due to batch of collection using RUVSeq
- It would be very complicated for RUVSeq to remove the unwanted variation due to batch effect when mixed with the unwanted variation due to the AUX treatment itself even on wt samples (plot not shown but done). Therefore, I run RUVSeq separately on AUX-treated and untreated samples.
- I choose RUVg, which uses a set of control genes (i.e. genes that do not change because of the biological factor of interest) to compute the unwanted factors.
3.2.1 RUVg effect, untreated blastocysts
3.2.2 RUVg effect, AUX-treated blastocysts
Setting k=6 in the UNT samples and k=3 in the AUX treated samples looks like the minimal number of k in order to have the separation between batches no more visible in the PCA. Let’s confirm this using the PCA of ranks as done above.
3.2.3 RUVg effect using PCA of ranks, untreated blastocysts
3.2.4 RUVg effect using PCA of ranks, AUX-treated blastocysts
3.3 Conclusion
I will perform a DESeq analysis in parallel in the two following ways:
- without removing unwanted variation
- by using a design matrix that includes both the covariates of interest and the factors of unwanted variation computed by RUVg.
The analysis will proceed in the Rmd file DE_analysis_from_featureCounts_post_RUVSeq.Rmd.
sessionInfo()## R version 4.1.3 (2022-03-10)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 20.04.6 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.9.0
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.9.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=it_IT.UTF-8 LC_COLLATE=en_US.UTF-8
## [5] LC_MONETARY=it_IT.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=it_IT.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=it_IT.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] stats4 stats graphics grDevices utils datasets methods
## [8] base
##
## other attached packages:
## [1] sva_3.42.0 genefilter_1.76.0
## [3] mgcv_1.8-39 nlme_3.1-155
## [5] factoextra_1.0.7 ade4_1.7-20
## [7] RUVSeq_1.28.0 edgeR_3.36.0
## [9] limma_3.50.3 EDASeq_2.28.0
## [11] ShortRead_1.52.0 GenomicAlignments_1.30.0
## [13] Rsamtools_2.10.0 Biostrings_2.62.0
## [15] XVector_0.34.0 BiocParallel_1.28.3
## [17] vsn_3.62.0 ggrepel_0.9.2
## [19] pheatmap_1.0.12 gridExtra_2.3
## [21] ggpubr_0.5.0 DESeq2_1.34.0
## [23] SummarizedExperiment_1.24.0 Biobase_2.54.0
## [25] MatrixGenerics_1.6.0 matrixStats_0.62.0
## [27] GenomicRanges_1.46.1 GenomeInfoDb_1.30.1
## [29] IRanges_2.28.0 S4Vectors_0.32.4
## [31] BiocGenerics_0.40.0 data.table_1.14.6
## [33] reshape2_1.4.4 ggplot2_3.4.0
##
## loaded via a namespace (and not attached):
## [1] backports_1.4.1 aroma.light_3.24.0 BiocFileCache_2.2.1
## [4] plyr_1.8.8 splines_4.1.3 digest_0.6.30
## [7] invgamma_1.1 htmltools_0.5.3 SQUAREM_2021.1
## [10] fansi_1.0.3 magrittr_2.0.3 memoise_2.0.1
## [13] annotate_1.72.0 R.utils_2.12.2 prettyunits_1.1.1
## [16] jpeg_0.1-9 colorspace_2.0-3 blob_1.2.3
## [19] rappdirs_0.3.3 xfun_0.35 dplyr_1.0.10
## [22] crayon_1.5.2 RCurl_1.98-1.9 jsonlite_1.8.3
## [25] survival_3.2-13 glue_1.6.2 gtable_0.3.1
## [28] zlibbioc_1.40.0 DelayedArray_0.20.0 car_3.1-1
## [31] abind_1.4-5 scales_1.2.1 DBI_1.1.3
## [34] rstatix_0.7.1 Rcpp_1.0.9 xtable_1.8-4
## [37] progress_1.2.2 bit_4.0.5 preprocessCore_1.56.0
## [40] truncnorm_1.0-8 httr_1.4.4 RColorBrewer_1.1-3
## [43] ellipsis_0.3.2 farver_2.1.1 pkgconfig_2.0.3
## [46] XML_3.99-0.12 R.methodsS3_1.8.2 sass_0.4.2
## [49] dbplyr_2.2.1 deldir_1.0-6 locfit_1.5-9.6
## [52] utf8_1.2.2 labeling_0.4.2 tidyselect_1.2.0
## [55] rlang_1.0.6 AnnotationDbi_1.56.2 munsell_0.5.0
## [58] tools_4.1.3 cachem_1.0.6 cli_3.4.1
## [61] generics_0.1.3 RSQLite_2.2.18 broom_1.0.1
## [64] evaluate_0.18 stringr_1.4.1 fastmap_1.1.0
## [67] yaml_2.3.6 knitr_1.40 bit64_4.0.5
## [70] purrr_0.3.5 KEGGREST_1.34.0 R.oo_1.25.0
## [73] xml2_1.3.3 biomaRt_2.50.3 compiler_4.1.3
## [76] rstudioapi_0.14 filelock_1.0.2 curl_4.3.3
## [79] png_0.1-7 affyio_1.64.0 ggsignif_0.6.4
## [82] tibble_3.1.8 geneplotter_1.72.0 bslib_0.4.1
## [85] stringi_1.7.8 highr_0.9 GenomicFeatures_1.46.5
## [88] lattice_0.20-45 Matrix_1.5-3 vctrs_0.5.1
## [91] pillar_1.8.1 lifecycle_1.0.3 BiocManager_1.30.22
## [94] jquerylib_0.1.4 irlba_2.3.5.1 bitops_1.0-7
## [97] rtracklayer_1.54.0 R6_2.5.1 BiocIO_1.4.0
## [100] latticeExtra_0.6-30 affy_1.72.0 hwriter_1.3.2.1
## [103] codetools_0.2-18 MASS_7.3-55 assertthat_0.2.1
## [106] rjson_0.2.21 withr_2.5.0 GenomeInfoDbData_1.2.7
## [109] parallel_4.1.3 hms_1.1.2 grid_4.1.3
## [112] prettydoc_0.4.1 tidyr_1.2.1 rmarkdown_2.18
## [115] ashr_2.2-54 carData_3.0-5 mixsqp_0.3-48
## [118] interp_1.1-3 restfulr_0.0.15